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Abstract. It has recently become popular to define treatment effects 
for subsets of the target population characterized by variables not ob- 
servable at the time a treatment decision is made. Characterizing and 
estimating such treatment effects is tricky; the most popular but naive 
approach inappropriately adjusts for variables affected by treatment 
and so is biased. We consider several appropriate ways to formalize 
the effects: principal stratification, stratification on a single potential 
auxiliary variable, stratification on an observed auxiliary variable and 
stratification on expected levels of auxiliary variables. We then outline 
identifying assumptions for each type of estimand. We evaluate the util- 
ity of these estimands and estimation procedures for decision making 
and understanding causal processes, contrasting them with the con- 
cepts of direct and indirect effects. We motivate our development with 
examples from nephrology and cancer screening, and use simulated data 
and real data on cancer screening to illustrate the estimation methods. 

Key words and phrases: Causality, direct effects, interaction, effect 
modification, bias, principal stratification. 



1. INTRODUCTION 

In the recent literature on causal inference, it has 
become popular to define treatment effects for sub- 
sets of the target population characterized by vari- 
ables not observable at the time a treatment deci- 
sion is made. The most popular framework for do- 
ing this is principal stratification (PS); this name 
was introduced in a unifying paper by Frangakis 
and Rubin (2002). The ideas have been applied to 
a broad range of problems, including censoring by 
death (Robins, 1986; Zhang and Rubin, 2003), non- 
compliance in randomized trials (Angrist, Imbens 
and Rubin, 1996; Baker and Lindeman, 1994), the 
estimation of the effects of vaccines on post-infection 
outcomes (Gilbert, Bosch and Hudgens, 2003) and 
surrogate outcomes in randomized trials (Frangakis 
and Rubin, 2002). As we shall see, PS is one of sev- 
eral possible ways to define these effects. 

The reasons for interest in effects defined by post- 
treatment auxiliary variables are diverse. We con- 
sider two problems, one in nephrology and one in 
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cancer screening, to provide motivation for inter- 
est in these various estimands and estimation pro- 
cedures. 

Nephrologists have been frustrated by the lack of 
a good means to lower the high rates of morbid- 
ity and mortality among patients after the onset of 
end-stage renal disease (ESRD), the point at which 
kidney disease has progressed sufficiently to require 
dialysis treatment. Suggested methods to reduce this 
morbidity and mortality, which have included bet- 
ter or more aggressive control of anemia and higher 
doses of dialysis, have shown disappointing results 
(Paniagua, Amato, Vonesh, Correa-Rotter, Ramos, 
Moran and Mujais, 2002; Eknoyan, Beck, Cheung, 
Daugirdas, Greene, Kusek, Allon, Bailey, Delmez, 
Depner, Dwyer, Levey, Levin, Milford, Ornt, Rocco, 
Schulman, Schwab, Teehan and Toto, 2002; Besarab, 
Bolton, Browne, Egrie, Nissenson, Okamoto, Schwab 
and Goodkin, 1998). Frustration with the inability 
of treatments or interventions given after the initia- 
tion of dialysis to affect the course of ESRD has led 
some to hypothesize that the period before ESRD 
develops may provide a window of opportunity to 
improve outcomes among ESRD patients. The hope 
is that interventions or treatments applied before 
ESRD develops may affect the clinical course after 
development of ESRD; that is, it is hypothesized 
that treatments given before ESRD affect outcomes 
after the development of ESRD. 

This apparently simple hypothesis is surprisingly 
hard to formalize. The difficulty stems from the fact 
that not all subjects with advanced chronic renal 
insufficiency (CRI) progress to ESRD, and that the 
same interventions that may affect outcomes after 
the onset of ESRD may themselves help determine 
who will develop ESRD. After considering a naive 
approach and illustrating its difficulties (Section 2), 
this paper considers several ways to formalize the 
hypothesis; for each, we briefly consider approaches 
to estimation of relevant and causally meaningful 
parameters (Section 3). 

A second motivating problem concerns evaluation 
of the efficacy of cancer screening. Successful meth- 
ods for screening for cancer result in earlier diagnosis 
of cancer (or precancerous conditions); this early de- 
tection may lead to treatment of the cancer while the 
cancer is still curable and so to reduced mortality. 
Randomized trials have been used to evaluate cancer 
screening; often, people randomized to screen fail to 
comply with their assignment. It is expected that 



any benefit of assignment to screening would be re- 
stricted to women who are screened. It might further 
be surmised that any benefit of screening would be 
restricted to screened women diagnosed with breast 
cancer and that the benefit of screening would be 
restricted even further to women whose cancer was 
diagnosed as a result of the screen. For explanatory 
purposes, it is of interest to estimate the benefit of 
screening for women in these subgroups. Hypotheses 
here may be formulated in ways similar to those in 
the nephrology problem. In these data, the outcome 
is failure-time, a censored continuous variable. 

In Section 4 we consider statistical inference. We 
consider ranges of assumptions that identify the var- 
ious estimands, as well as methods of estimation. 
Where necessary, we concentrate on continuous out- 
comes, as some methods for inference are more 
straightforward here. In Section 5 we consider a sim- 
ulation experiment to provide some comparison of 
inference under various approaches. In Section 6 we 
analyze data about cancer screening from the Health 
Insurance Plan (HIP) Study (Shapiro, Venet, Strax 
and Venet, 1988), considering various estimands of 
interest. 

In Section 7 we evaluate and compare the various 
approaches in terms of their utility for decision mak- 
ing and explanation. In addition, we compare the 
approaches conditioning treatment effects on post- 
treatment auxiliary variables to the estimation of di- 
rect and indirect effects. The paper concludes with 
discussion of extensions of the estimands and meth- 
ods to more complex settings. 

2. MOTIVATING EXAMPLE: DATA AND 
SIMPLE ANALYSIS 

2.1 Data and Potential Outcomes 

We motivate our methodological development with 
a simple numerical example. The example is from a 
study in a cohort of subjects with chronic renal in- 
sufficiency, a condition in which subjects have di- 
minished kidney function but do not yet require 
dialysis or transplant (Feldman, Appel, Chertow, 
Cifelli, Cizman, Daugirdas, Fink, Franklin-Becker, 
Go, Hamm, He, Hostetter, Hsu, Jamerson, Joffe, 
Kusek, Landis, Lash, Miller, Mohler, Muntner, Ojo, 
Rahman, Townsend and Wright, 2003); although the 
numbers are arbitrary, they are intended to repre- 
sent, in simplified fashion, the problems and associ- 
ations present in studying effects in this population. 
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Here, we wish to study the effect of aggressive treat- 
ment of hypertension on myocardial infarction (MI) 
among subjects who develop ESRD after the start 
of follow-up. We make several simplifying assump- 
tions, which we do not necessarily expect to apply 
in real data: 

1. each of our main study variables is binary and 
scalar; 

2. subjects are randomly assigned to either receive 
or not receive aggressive management of hyper- 
tension; 

3. no subject has ESRD at the start of follow-up; 
and 

4. subsequent ESRD status is recorded before any 
MI occurs. 

We use A to refer to the treatment of interest [A = 
1 (0) indicates the presence (absence) of aggres- 
sive treatment of hypertension] , S to denote a post- 
treatment auxiliary variable [/S = 1 (0) indicates the 
presence (absence) of ESRD] and Y to refer to the 
outcome of interest [y = 1 (0) indicates the occur- 
rence (absence of occurrence) of an MI before the 
end of follow-up] . 

We adopt the potential outcomes approach (Ney- 
man, 1990; Rubin, 1974) to illustrate and define 
our causal estimands of interest. Let denote the 
outcome that would be seen were a subject given 
treatment level a and let S*^ denote the level of the 
auxiliary variable were a subject given treatment a. 
Causal effects are normally defined in terms of com- 
parisons of the outcomes that would be seen in the 
same individuals or groups under different condi- 
tions; for example, as comparisons of V and for 
a' . For this illustration, we assume that aggres- 
sive treatment does not affect MI for any individual 
(i.e., that = Y^ for all subjects); however, aggres- 
sive treatment will prevent ESRD for some subjects 
but never cause it, and so < S^. 

Table 1 classifies the population according to 
whether they would develop ESRD if treated and 



if untreated, and considers the risk of failure in all 
strata based on the cross-classification of this auxil- 
iary variable. Half of the population would not de- 
velop ESRD whether or not they were treated (i.e., 
= = 0); the risk of MI in this group is low 
(10%). In another 30% of the population, aggressive 
treatment would prevent ESRD (i.e., = 0,5" = 
1); the risk of MI in this group is higher (20%). The 
risk is highest (30%) in the 20% of the population 
doomed to get ESRD regardless of treatment (i.e., 
51 = 1,50 = 1). 

In any study, we cannot simultaneously observe 
what would happen to any individual under aggres- 
sive treatment {Y^,S^) and in its absence {Y^,S^), 
and so the joint distribution of the variables repre- 
sented in Table 1 is not estimable. Table 2 shows 
what would be observed in a randomized trial in 
which half of the subjects receive aggressive treat- 
ment and half do not. Because 30% of subjects in 
the cohort would develop ESRD only if not treated 
aggressively (Table 1, row 2), the proportion of sub- 
jects treated aggressively who develop ESRD (20% = 
100/500) is much lower than the proportion of sub- 
jects not treated aggressively who develop ESRD 
(50% = 250/500). Among untreated subjects who 
develop ESRD, 24% develop an MI, whereas 30% of 
treated subjects who develop ESRD also develop an 
MI. Similarly, 10% of untreated subjects who do not 
develop ESRD later develop an MI, whereas 13.75% 
of treated subjects who do not develop ESRD later 
develop an MI. All of this may be derived (in expec- 
tation) from Table 1. 

2.2 A Naive Approach 

To examine the effect of aggressive treatment on 
MI among subjects who develop ESRD, some would 
compare the probability of MI among aggressively 
treated subjects who develop ESRD (0.3) with the 
probability of MI among untreated subjects who de- 
velop ESRD (0.24). A naive interpretation is that 



Table 1 

Probability of MI if treated aggressively or not, by ESRD status if treated aggressively or not 



ESRD status 






Probability of MI 






if untreated {S°) 


if treated (S^) 


priY° = l\S°., 






= i\s°,s'-) 


N 








0.1 






0.1 


500 


1 





0.2 






0.2 


300 


1 


1 


0.3 






0.3 


200 
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Table 2 

Probability of MI m randomized trial, by treatment arm 
and observed ESRD status 



Treatment {A) 


ESRD (5) 


N 


Probability of MI 

(pr{Y = 1\S,A)) 








250 


0.1 




1 


250 


0.24 


1 





400 


0.1375 




1 


100 


0.3 



the difference between these probabihties (0.06) rep- 
resents the effect of treatment for subjects who wih 
develop ESRD. This interpretation is not correct, 
since aggressive treatment actually has no effect on 
MI for any subject. The naive comparison pr{Y = 
l\A = 1,5 = 1) -pr{Y = 1|A = 0,5 = 1) diverges 
from the true individual causal effects because mem- 
bership in the groups being compared depends on 
treatment. Aggressive treatment reduces the num- 
ber of subjects with ESRD. Subjects who would de- 
velop ESRD even if treated aggressively comprise 
the sickest subgroup in the study. A higher propor- 
tion of them than any other subgroup would have 
developed MI whether or not they had received ag- 
gressive treatment, and comparing them with un- 
treated subjects who develop ESRD is comparing 
them to a combination of subjects from the same 
subgroup and subjects who would develop ESRD 
only if not treated aggressively. Thus, condition- 
ing on ESRD, a post-treatment variable, leads to 
inappropriate estimates of overall treatment effect 
(Robins, Blevins, Ritter and Wulfsohn, 1992; Rosen- 
baum, 1984). 

A complementary approach to view the bias re- 
sulting from conditioning on ESRD involves directed 
acyclic graphs (DAGs) (Pearl, 1995, 2000). Figure 1 
shows the relations between the observed variables 
A, S and y, and unobserved variable(s) U . The ar- 
rows from U to both 5 and Y indicate that there 




Fig. 1. A directed acyclic graph representing the relations 
among the variables in the example of Section 2. 



are some unmeasured common causes of both. The 
arrow from j4 to 5 indicates that A influences 5, 
whereas the absence of any directed path (i.e., a se- 
ries of directed arrows) from exposure to outcome 
indicates that A has no effect on y. 5 is known 
as a collider (i.e., there are arrows which converge 
on 5). It is well known that conditioning on col- 
liders induces associations between the parents of 
the collider (i.e., between A and U). Because U in- 
fluences Y , the conditional association between A 
and U given 5 propagates to a conditional associ- 
ation between A and Y given 5. Structurally simi- 
lar problems involving selection bias in epidemiology 
are discussed elsewhere (Greenland, 2003; Hernan, 
Hernandez-Diaz and Robins, 2004). 

We consider briefly an appropriate causal inter- 
pretation of the naive comparison of observable con- 
ditional distributions pr{Y = 1\A= 1,5= \)—pr(Y = 
1|A = 0,5 = 1). Ina randomized trial, the observed 
distribution pr{Y\A = a,S = s) equals the condi- 
tional distribution pr{Y"'\S"' = s) of the potential 
outcome that would be seen among subjects who 
would have a common value of the auxiliary vari- 
able 5 if they receive level a of treatment. Here, 
this is the probability of MI that would be seen 
in subjects developing ESRD were all subjects to 
be treated aggressively (for a = 1) or not receive 
aggressive treatment (for a = 0). A comparison of 
pr(y"|5" = s) for different values of treatment a re- 
flects the impact of treatment on the conditional 
distribution of the potential outcome given the aux- 
iliary outcome. In nonrandomized studies, the ob- 
served conditional distributions pr{Y\A = a, S = s) 
will not in general equal the potential conditional 
distributions pr{Y^\S"' = s). So long as information 
is collected on all subjects in a cohort, this condi- 
tional distribution is identified under the commonly 
used assumption of strongly ignorable treatment as- 
signment (Rosenbaum and Rubin, 1983), 

(1) pr{A = a\S,Y,X)=pr{A = a\X)>0, 

where 5 = {S*"} and Y_ = {Y'^} denote the vectors 
of potential auxiliary and main outcomes, respec- 
tively. Strong ignor ability identifies pr{Y°'\X, S"" = 
s) as pr{Y\X,A = a, 5 = s). The potential condi- 
tional distributions may be useful in decision mak- 
ing (Section 7.1). 

Sometimes, the effects of an intervention on a (con- 
ditional) distribution will be all that is identifiable 
from one's data under plausible assumptions. Here, 
however, there are other measures of effect that more 
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closely relate to the scientific questions of interest, 
regarding the effect that aggressive treatment has 
for subjects who develop ESRD. We consider these 
in the following sections. 

3. DEFINITION OF EFFECTS FOR A 
COMMON SET OF INDIVIDUALS 

We consider several ways of characterizing or defin- 
ing effects for subjects who will develop (or are likely 
to develop) an auxiliary outcome. In each approach, 
we consider first the group or subgroup for whom 
effects are defined; for each approach, the definition 
of effects for this group follows in a straightforward 
way from the potential outcomes approach. We re- 
late the effects to each other through a common 
probability model in Section 3.7. 

3.1 Principal Stratification 

Principal stratification (PS) (Frangakis and Ru- 
bin, 2002) is a method proposed recently by several 
authors for defining certain types of causal effects. In 
this approach, effects are characterized within strata 
defined by the vector of potential auxiliary outcomes 
(here and S^). In our example, we may be inter- 
ested in the effect of aggressive treatment for sub- 
jects who would develop ESRD whether or not they 
are treated aggressively (i.e., = = 1). In par- 
ticular, we concentrate here on comparing the pro- 
portion in this subgroup that would have an MI if 
treated aggressively \pr(Y^ = 1\S^ = = 1)] and 
the proportion in the same subgroup who would 
have an MI if not treated aggressively [pr(Y^ = 1\S^ = 

= 1)]. In general, the approach compares the ex- 
pected value or distribution of potential outcome Y°- 
for different levels of treatment (a = 0, 1) in strata 
defined by the levels the auxiliary variable would 
take under both levels of treatment (i.e., strata are 
defined by and jointly). In our data, this stra- 
tum is shown in the third data row of Table 1; ag- 
gressive treatment has no effect on MI in this group 
(as in all other principal strata). 

In general, the principal strata are not fully iden- 
tified from the data, because one cannot simulta- 
neously observe both potential auxiliary outcomes 
and S^. In the data in Table 1, aggressive treat- 
ment sometimes prevents but never causes ESRD 
(i.e., > S^); this monotonicity (Angrist, Imbens 
and Rubin, 1996) is not completely plausible in the 
nephrology example; Section 8.1 discusses this at 
more length. Under monotonicity, the principal stra- 
tum of some subjects is identifiable: aggressively 



treated subjects who develop ESRD would have done 
so even had they not been treated (i.e., A= 1, S = 
= 1 implies = = 1), and untreated subjects 
who do not develop ESRD would not have done so 
even had they been treated (i.e., ^ = 0, S = = 
implies 51 = 50 = 0). 

3.2 Single Potential Stratification 

One can define the effect of a treatment for a sub- 
group defined by a single potential outcome. For ex- 
ample, one may be interested in the effect of ag- 
gressive treatment for people who would develop 
ESRD if they were treated aggressively [a compari- 
son oipr{Y^ = 1\S^ = 1) and pr(Y^ = 1\S^ = 1); Ta- 
ble 1, row 3], or in the effect of aggressive treatment 
on people who would develop ESRD if they were 
not treated aggressively [a comparison of pr(Y^ = 
1|5° = 1) and pr{Y° = 1|5° = 1); the last two rows 
in Table 1]. Such stratification may be viewed as a 
coarser form of PS. 

Membership in this stratum, defined by a single 
auxiliary variable {S^ or S^), is only partially ob- 
served. Whether a person is in this single auxiliary 
stratum defined by 5" is known if a person receives 
treatment level a. As above, this complicates sta- 
tistical inference for effects defined for such groups. 
One may use approaches such as those used for PS 
to estimate effects defined in this fashion; because 
of the connection with observed auxiliary stratifi- 
cation, one can also use methods described for the 
next sort of stratification we discuss. 

3.3 Observed Auxiliary Stratification 

For this approach, we define effects for groups de- 
fined by observed auxiliary variables. For example, 
we might consider the effect of aggressive treatment 
for subjects who received nonaggressive treatment 
and developed ESRD; that is, we compare pr{Y^ = 
l\S = l,A = 0) with f3r(y° = 1|5 = 1, A = 0). The 
subjects for whom this effect is defined are the 250 
subjects in the second row of Table 2 (who comprise 
the 50% of the subjects in the second and third rows 
of Table 1 who are untreated). Alternatively, we 
might be interested in the effect of aggressive treat- 
ment for subjects who received aggressive treatment 
and developed ESRD; that is, we compare pr(Y^ = 
l\S = 1,^ = 1) with pr(yO = 1|5 = 1,^ = 1) (here, 
the effects are defined for the last row of Table 2, or 
the 50% of the last row of Table 1 who are treated) . 
We have called the latter comparison the realized ef- 
fect of treatment (Joffe, 2001), because it represents 
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the effect treatment actually had within a subgroup 
(which, in this instance, is defined by post-treatment 
variables). 

In a randomized trial, observed auxiliary strat- 
ification is equivalent to single potential auxiliary 
stratification. To see this, note that pr{Y"' =l\S = 
l^A = a) = priV' = 115"^ = l^A = a) = priY"' = 
lis*" = 1); the last step follows because of random- 
ization. Thus, for example, in a randomized trial 
in our ESRD example, aggressively treated subjects 
who develop ESRD are comparable to the set of sub- 
jects who, if they had received aggressive treatment, 
would have developed ESRD. In observational stud- 
ies, under ignorable treatment assignment (Rosen- 
baum and Rubin, 1983), the above holds conditional 
on covariates. 

Here, the subgroup for whom the effect is defined 
is fully observable; however, the effects are defined 
for groups that are not identified at time of treat- 
ment decision. Thus, like PS and single potential 
stratification, this approach cannot be used directly 
to predict, at the time of a treatment decision, the 
effect of that decision. Further, there is an explana- 
tory flavor to the analysis and model: these effects, 
which have already happened to defined subgroups, 
can be used to explain differences between random- 
ized groups. 

3.4 Expected Auxiliary Stratification 

Unlike previous approaches, one can define effects 
in a way that uses information on auxiliary variables 
to define subgroups identifiable at the time treat- 
ment decisions are made. Here, we define effects for a 
group of subjects who are likely to develop ESRD if 
given a particular treatment. Let n"'{X) = E(5"|X) 
denote the probability, given baseline covariates X, 
of developing ESRD if one were to receive treatment 
level a; it is the expected value of the potential aux- 
iliary variable. ^^"'{X) has been called a "principal 
score" (Hill, Waldfogel and Brooks-Gunn, 2002). We 
then define effects as a comparison of potential out- 
comes (MI) for subjects with the same expected aux- 
iliary (ESRD), for example, E{Y^\^i^{X)} - 
E{Y^\^^(X)}. Alternatively, we can define effects 
for broader strata based on ^°(X), for example, 
^{yVn^) >0.8}-S{y°|/in^) >0.8},the effect 
of aggressive treatment for subjects with at least an 
80% chance of developing ESRD if treated aggres- 
sively. 



3.5 Expected Multiple Auxiliary Stratification 

The approach of the last section may be extended 
to condition effects on multiple expected auxiliary 
variables. Thus, one might be interested in the effect 
of treatment for subjects with an 80% risk of devel- 
oping ESRD if treated aggressively and a 90% risk if 
not treated aggressively; that is, we derive effects for 
subjects based in groups determined by both ijP{X) 
and /i^(X). This approach has the flavor of PS; un- 
like PS, the subgroups for whom effects are defined 
are fully identified in the data, based solely on pre- 
treatment information. As in the previous section, 
conventional statistical methods apply. 

3.6 Conventional Approach 

A final alternative is to estimate effects in sub- 
groups based solely on pretreatment covariates, where 
group membership is not dependent on any risk score 
like n'^{X). Conventional estimation approaches may 
apply; we may look for effect modification by base- 
line covariates directly, rather than indirectly through 
the expected auxiliary. 

3.7 Probability Models for the Data 

Following Rubin (1978), we consider a formal prob- 
ability model for the joint distribution of the observ- 
able data and potential outcomes. We then consider 
approaches to parametrizing parts of this distribu- 
tion (Section 4). 

One general way to factor the joint density of the 
observable quantities and potential outcomes is 

f{X,S,Y,A) 

(2) 

= f{X)fiS\X)fiY\X,S)fiA\X,S,Y). 

This factorization is akin to selection models in com- 
mon use in longitudinal data analysis (Little, 1995). 
Strongly ignorable treatment assignment (Rosenbaum 
and Rubin, 1983) is often assumed. 

PS estimands depend only on one part of the joint 
density: fiY_\X,S_). Other causal estimands also in- 
volve the density of the auxiliary outcome (and some- 
times of the exposure). Single potential stratification 
estimands involve fiYjXjS""). These estimands can 
be obtained by integrating out the other potential 
auxiliary outcomes; that is, f{Y_\X, S"") = Jg^a f{Y_\X, 
S)f{S\X) ds^V/(5"|X), where S^" refers to the vec- 
tor of unobserved potential auxiliary outcomes. Ob- 
served auxiliary estimands involve f{Y_\X,S,A) = 
f{Y,A\X,S)/f{A\X,S) = Js^af{Y\X,S)f{S\X) ■ 
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f{A\X,S,Y)dSryJs^aJyf{Y\X,S)f{S\X)f{A\X, 
S., Y_) dSr"' dY_; under ignorability, this simphfies to 
f{Y\X,S,A = a) = Js^af{Y\X,S)f{S\X)ds-y 
/(S'^jX), which is the estimand of single potential 
stratification. Expected auxiliary stratification in- 
volves integrating out all the auxiliary potential out- 
comes; that is, 

f{Y\E{S'^\X) = s} 

_Ix:EiS'^\x)=sIsfiy\X,S)f{S\X)f{X)dSdX 

Ix:E{S^\X)=sf(^)dX 
_ Ix:EiS^\X)=sfmX)f{X)dX ^ 
Ix:E{S''\X)=s fi^) dX 

in this formulation, one can ignore the density of 
the auxiliary outcome except as it relates to identi- 
fying the subset over whom to average. Similarly, in 
stratifying on observed variables, one can integrate 
out and ignore the auxiliary outcome: f{Y\X) = 
hf{Y_\X,S)f{S\X) dX. Typically, inference concen- 
trates on comparisons of the marginal densities 
f(Y"'\X,-) of the potential outcomes for different 
treatment levels a, rather than the joint density 
•) of the potential outcomes Y_. 

4. CONTINUOUS OUTCOMES: MODELS 
AND ESTIMATION 

This section considers statistical inference about 
the various estimands outlined in the previous sec- 
tion. Some estimation methods in this setting are 
more straightforward with continuous outcome data, 
and so we discuss models and estimation methods 
more fully for such data. We apply these methods 
both to simulated data (Section 5) and the HIP data 
(Section 6). 

We now discuss estimation of the various causal 
quantities defined in the previous sections, concen- 
trating on identification of causal contrasts. We re- 
verse the order of discussion, beginning with strat- 
ification on observed variables and ending with PS, 
as the nature of the latent structure and identify- 
ing assumptions becomes increasingly complex. The 
greater the degree of latent structure, the more as- 
sumptions are needed to estimate parameters in the 
model. The greater degree of assumptions involved 
is justified if they lead to better decision making, 
more explanatory power or greater generalizability, 
issues we take up in Section 7. 



4.1 Conventional Approaches 

For the conventional approach, identification of 
the marginal distributions f(Y°'\X) may be based 
on the assumption of ignorable treatment assign- 
ment (Rosenbaum and Rubin, 1983), pr[A = a\ 
X,Y_) =pr{A = a\X), with ^<pr{A = a\X) < 1 for 
all X,a. Under this assumption, the observed den- 
sity f{Y\X,A = a) equals the density of the poten- 
tial outcome /(y°|X). Thus, one can use standard 
approaches (e.g., regression of y on X and A) to 
estimate the effect of ^ on y. 

4.2 Stratification on Expected Auxiliaries 

For stratification on expected auxiliary variables 
(Hill, Waldfogel and Brooks-Gunn, 2002), identifi- 
cation and estimation are somewhat more compli- 
cated. The simplest method of estimation involves 
two steps: estimating the expected auxiliary /i"(X), 
and estimation of effects by level of this expected 
auxiliary. One can estimate the expected auxiliary 
under ignorable treatment assignment [here 
f{A\X,S) = f{A\X)] (Rosenbaum and Rubin, 1983) 
using standard methods. For example, one can regress 
5 on X for subjects with A = a, then compute the 
expected value as fi"'{X) based on the estimated re- 
gression coefficients. Alternatively, one can regress S 
on X and A, then estimate jj-°'{X) as E{S\X^A = a), 
using, for each subject, his or her observed X and 
the desired treatment level a for all subjects; here, 
one may choose to include interactions between A 
and X as appropriate. 

Under ignorable treatment assignment for the out- 
come y, one can again use standard methods to es- 
timate the effect of treatment for a group classified 
by jl'^{X). For example, one can fit a regression 

E{Y\fi'^{X),A} 

(3) 

= /3o + fi''{X)(if, + AI3a + fi''{X)A(if,A-: 

here, the effect of treatment E{y^|/i"(X)} - E{Y^\ 
fi"'{X)} for subjects with expected auxiliary jl°'{X) 
is fiA + fi'"'iX)Pfia- These plug-in type estimates are, 
in general, consistent for the true parameters Pa 
and P^a in the corresponding regression on the true 
scores ;u"(X) but will typically be biased in small 
samples; this bias results from the fact that the es- 
timate /i" is a mismeasured version on the true ex- 
pected auxiliary /x". It is of some interest to develop 
unbiased estimators of these effects. Although we 
are unaware of any such work in this setting, such 
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work has been done with other generated regressors 
(Pagan, 1984). 

It is tempting to extrapolate the regression effect 
and interpret Pa + Pfia as the effect for a subgroup 
in which all subjects will develop ESRD if treated 
aggressively, or as the effect for the group of sub- 
jects who will develop ESRD if treated aggressively. 
These interpretations are flawed. For the former, 
there may be no subgroups identifiable on the basis 
of pretreatment covariates in which all subjects will 
develop ESRD, and so the parameter is not mean- 
ingful. For the latter, additional assumptions may 
be required for this interpretation to hold, as has 
been discussed in a related context (Joffe, Ten Have 
and Brensinger, 2003). 

4.3 Stratification on Observed Auxiliaries 

We are unaware of previous work on estimation of 
parameters in models stratifying on observed auxil- 
iary variables. Whereas the previous estimands al- 
lowed nonparametric identification under ignorabil- 
ity assumptions, estimation of these parameters will 
require additional assumptions. We sketch one ap- 
proach to estimation below; because of the relation 
between observed auxiliary stratification and single 
potential stratification and PS (Section 3.3), esti- 
mation may also be based on estimating principal- 
strata specific effects as described in the next subsec- 
tion, then marginalizing over the unobserved auxil- 
iary outcomes 5""^. 

We consider estimation based on the following 
idea, similar to G-estimation in structural nested 
models (Robins, 1992; Robins et al., 1992; Robins, 
1994). Estimation will require ignorability assump- 
tions, as above. Suppose our outcome Y is continu- 
ous; Y could be the logarithm of a failure-time (e.g., 
time of breast cancer mortality). We first propose a 
model for the effect of treatment, for example, 

E{Y^\X,A,S) 

(4) 

= E{Y\X,A,S)-A{l-S)^o-AS^i. 

Here, the realized effect of treatment for the sub- 
group of subjects who received treatment and de- 
veloped ESRD is ^'i, and the realized effect for the 
subgroup which received treatment but did not de- 
velop ESRD is We have assumed that these ef- 
fects do not vary with covariates X. 

Under ignorability, Y^ is independent of A given 
X. Let yO(*) =Y -A{1- 5)*o - AS^r, y°(*) 
may be computed from observed quantities and pu- 
tative values for the causal parameters Based 



on the nonidentifiable assumption that treatment 
effects are the same for all subjects with common 
values of A and S, Y^{'^) may be viewed heuristi- 
cally as the potential outcome Y^ if causal theories 
represented by {*0)*i} are true. If the putative 
value of the causal parameter ^ is true, Y^{'^) will 
be independent of A given X . Estimation may be 
based on testing this independence for an assumed 
value of the causal parameter ^. Because ^ is a 
vector of dimension 2, estimation using scalar es- 
timating equations will require either restriction of 
the unknown parameter ^ or a vector of estimating 
equations of the same dimension as ^. 

In some cases, one might assume that either ^'o 
or ^1 is 0. In the HIP study, it might be reason- 
able to assume that screening affects breast cancer 
mortality only for screened subjects diagnosed with 
breast cancer, or, further, only for screened subjects 
whose cancer was detected due to the screen (i.e., 
for subjects for whom S* = 1, ^4 = 1), and so ^'o = 0. 
For binary A, 

(5) Y.(^-p)g{Y\^),X} = 

i 

provides valid estimating equations under the model 
assumptions and ignorability, where p = pr{A = 1\X) 
and g{-) is a known function of its arguments. The 
optimal function g{-) is a sometimes complex func- 
tion of the joint density of the observables and po- 
tential outcomes (Joffe and Brensinger, 2003; Robins, 
1992; Robins et al., 1992). Efficiency, but not con- 
sistency, depends on choosing this optimal function. 
Suppose that the "error" terms e = Y^ — E(Y^\X) 
are normal, independent, identically distributed ran- 
dom variables and that is unrelated to Y^ [i.e., 
f{Y°\X,A,S^) = fiY^\X)]. The optimal function is 
then g{Y°{^),X} = e{^)E{S\X,A = 1), where 
e(*) = yO(*) - E{Y^{^)\X} (Joffe and Brensinger, 
2003); the use of e(^) is similar to Rosenbaum's 
(2002) use of such residuals in randomization-based 
inference. The estimated probability of treatment 
may be substituted for the typically unknown true 
p. The asymptotic variance of the resulting estima- 
tor may be derived using a sandwich-type formula 
(Robins, 1992; Robins et al., 1992). 

If one is unable to restrict the parameter ^ based 
on subject-matter considerations, one must use a 
vector of estimating equations. Here, we will require 
vector functions g{Y{'*if),X}. Under the above nor- 
mality and homoscedasticity assumptions and under 
the assumption that f{Y^\X,A,S^) = f(Y^\X), the 
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optimal function is the vector function g{Y^{^), X} = 
E{^){l-EiS\X,A = l),E{S\X,A = l)}''.g{Y%^), 
X} must have the same rank as the dimension of 
thus, if the covariate does not predict the auxihary 
outcome among the treated, there will be no ability 
to estimate the vector Under our assumptions 
of equal effect across strata of X, the effect of A 
on Y in any stratum of X is ^o{l — E{S\X,A = 
1)} + ^iE{S\X, A = l), each term corresponding to 
part of the vector function g{Y^{'^),X}. 

The approach taken above is semiparametric; that 
is, consistency of the estimators of the parameters of 
interest ^ does not depend on parametric assump- 
tions about the distribution of e or the association of 
X or S with Y^. The methods presented above are 
valid for structural distribution models, which map 
percentiles in the distribution of Y^ to percentiles in 
the distribution of Y^ in defined subgroups. In con- 
trast, the weaker structural mean models only map 
the mean of Y^ to the mean of Y^ [as in the for- 
mulation in (4)]. Consequently, there are fewer valid 
choices of the function g{-) for use in estimation [i.e., 
the function g{-) must be linear in y'^(^')] (Robins, 
Rotnitzky and Scharfstein, 2000). 

With binary outcomes, mean models are required. 
Structural mean models using the identity link do 
not restrict subgroup-specific means to the permissi- 
ble range. With the usual logit link, semiparametric 
models may be formulated, but consistent semipara- 
metric estimators are not generally available (Robins, 
Rotnitzky and Scharfstein, 2000; Robins and Rot- 
nitzky, 2004). Extensions to binary outcomes whose 
consistency depends on parametric assumptions are 
available (Robins and Rotnitzky, 2004; Vansteelandt 
and Goetghebeur, 2003). 

4.4 Principal Stratification 

Because the principal strata of many subjects are 
not determined by the usual combination of data 
and assumptions, inference for PS estimands will be 
more dependent on assumptions and complicated. 
The observed density of the main and auxiliary out- 
comes may be expressed in terms of the unobserved 
potential auxiliary outcomes S_'^ as follows: 

f{Y,S\A = a,X) 

= fiY\S,A = a,X)fiS\A = a,X) 

(6) ■ f{S'',Sr'' = sr''\A = a,X) 



= J2f{Y\S,Sr'' = s-'',A = a,X) 

■ fiSr"" =r''\S,A = a,X) 

■fiS\A = a,X). 

Under ignorability, the components of the right-hand 
side of (6) are equivalent to models for the potential 
main and auxiliary potential outcomes, f{Y\S,S_~^'^, 
A = a,X) = /(y«|5,X) and /(5,5^"|A = a,X) = 
f{S_\X). Thus, parametrizing the models for the ob- 
servables can lead to a likelihood for the causal quan- 
tities of interest. 

This likelihood is generally overparametrized. Sup- 
pose that both S and A are scalar and binary. Then, 
there are six unknown observed densities for each 
level of covariates X: two for f{S\A = a,X) (one 
for each level of A), and four for f{Y\S, A = a,X) 
(one for each level of A and S). However, there are 
eleven densities for the causal parameters in (6): 
three for f{S_\X) (there are four levels of «S, but the 
probabilities sum to 1), and eight for /(y"|X,5) 
(two levels of a x two levels of 5 x two levels of 
S_~^^). Thus, identification of causal effects will re- 
quire further restrictions on the parameters. Before 
proceeding to discuss such restrictions, we note that 
there are (at least) two approaches which do not re- 
quire identification: a Bayesian approach, in which 
prior information is combined with the likelihood 
to produce a posterior (Imbens and Rubin, 1997), 
and approaches which derive bounds on causal ef- 
fects (Balke and Pearl, 1997; Manski, 1990; Robins, 
1989; Rubin, 2004; Zhang and Rubin, 2003; Cheng 
and Small, 2006). 

There are several forms of restrictions that can 
be applied. These include restrictions on the joint 
distribution of the potential auxiliary outcomes S 
(i.e., on the principal strata), and restrictions on 
the marginal distributions of the potential main out- 
comes 

One type of restriction on the auxiliary outcome 
occurs when some level of the auxiliary outcome is 
impossible under some treatment level a. Consider 
first randomized trials with noncompliance, one area 
in which the ideas of PS have been applied fre- 
quently (Angrist, Imbens and Rubin, 1996; Baker 
and Lindeman, 1994). In one view of these studies 
(Imbens and Rubin, 1997), one may view random- 
ization, the controlled factor, as the treatment A 
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whose effect one is interested in estimating, and the 
level of exposure to the experimental therapy S as 
the auxiliary variable. One may be interested in the 
effects of randomization for different classes of sub- 
jects defined by the set of behaviors S_ = {S^,S^} 
they would follow under different treatment assign- 
ments. In some randomized trials, subjects assigned 
the control treatment {A = 0) have no access to 
the experimental therapy {S = 1); this may be com- 
mon with investigational therapies not available out- 
side of the trial. In this case, there are two rather 
than four principal strata S_ : {S^, S^} = (0, 1) (com- 
pilers) and {5'*^,S'^} = (0,0) (never-takers) . In the 
HIP Study, it may be reasonable (under the con- 
ditions of the study in the 1960s) to reason that 
subjects not assigned to screening {A = 0) could 
not have received mammograms and so could not 
have been diagnosed as a result of mammography 
{S = 1). In these cases, the principal stratum of 
the treated subjects is known; since 5*^ = 0, know- 
ing the observed outcome S = in a treated sub- 
ject fully identifies S_. Further, under ignorability, 
the proportion of subjects in each principal stratum 
pr(S_\X) is identified. Under ignorability, the densi- 
ties /(yi \X, S^,S^ = s) are identified as f{Y\X, S = 
s , A = 1 ) , but the densities f {Y^\X,S, A = 0) are not 
identified, and so the stratum-specific causal effects 
are not identified without further assumptions. 

A more general form of restriction on the potential 
auxiliary outcomes consists of monotonicity restric- 
tions, which have been adduced in a variety of set- 
tings (Efron and Feldman, 1991; Gilbert, Bosch and 
Hudgens, 2003; Angrist, Imbens and Rubin, 1996). 
For any two ordered levels of treatment a, a', a' > 
a, strict monotonicity states that 5"* > for all 
subjects (or, alternatively, that 5" < 5"). Mono- 
tonicity identifies the principal stratum of some sub- 
jects. For binary S, monotonicity and ignorability 
together permit identification of the proportion of 
subjects in each principal stratum. In the HIP study, 
if 5 is cancer diagnosis, one might assume that, if 
a cancer were detected in an unscreened subject, 
it also would have been detected had the subject 
been screened (i.e., > S^). This assumption may 
be incorrect; for example, a woman who has been 
screened and who was told that she had no cancer 
may become less suspicious of cancer later, and so 
screening might sometimes lead to a missed diagno- 
sis of cancer. 

None of these monotonicity assumptions, even in 
conjunction with ignorability, is sufficient to identify 



stratum-specific treatment effects. Such identifica- 
tion typically requires assumptions about the out- 
come distributions /(y"|X,S^). We outline several 
such assumptions. 

One strict set of assumptions is that treatment has 
no effect on the outcome Y for some subsets of the 
data defined by S_. In randomized trials, it is com- 
mon (Mark and Robins, 1993a; Sommer and Zeger, 
1991) to assume that randomization has no effect 
for those subjects for whom treatment has no effect 
on 5 [i.e., = Y"' if 5" = S"', or f{Y''\X,S) = 
f{Y°- \X,S_) if S*" = 5° , a weaker assumption]; this 
assumption is known as an exclusion restriction in 
the econometrics literature (Angrist, Imbens and 
Rubin, 1996). In studies of breast cancer screening 
(e.g., HIP), one might assume that screening has 
no effect among subjects whose tumors were not 
detected through screening (i.e., for subjects with 
5^ = 0, S**^ = where 5 = 1 if a subject has a screen- 
detected tumor, otherwise). In both of these ex- 
amples, these restrictions, together with the mono- 
tonicity restrictions and ignorability, are sufficient to 
identify the causal effects [and, in fact, the marginal 
densities f{Y"'\X,S_)] in the single principal stratum 
in which treatment effects are not assumed to be 
0. To see this, note that the marginal density of 
the outcome may be written /(y"|X) = J2sP''"{S.= 
s)f{Y'^\X,S_ = s); under our assumptions, pr{S_ = s) 
is known for all s, and f{Y'^\X,S_ = s) is identified as 
either f{Y''\X,S,A = a) or f{Y''\X,S,A = l- a) in 
the principal strata in which treatment has no effect. 
Further, the marginal densities f{Y°'\X) are identi- 
fied from the data under ignorability. This leaves one 
unknown quantity f{Y"'\X,S_ = s) for the principal 
stratum in which the effect is not assumed to be 
zero; we then solve /(y"|X) = EsPr{S = s)f{Y^\ 
X,S_ = s) to find this unknown density. 

When these assumptions are not reasonable, the 
principal stratum-specific effects are not identified 
without other assumptions. These assumptions can 
take several forms: assumptions about how differ- 
ent principal strata are from each other with re- 
spect to the potential outcomes, assumptions about 
the distribution of potential outcomes within princi- 
pal strata, and assumptions about the associations 
of covariates X with the potential outcomes within 
principal strata. 

Assumptions about the differences between princi- 
pal strata may involve specifying E{Y^\X,S_ = s) — 
E(Y^\X,S_ = s') for some s,s',s^s'; such assump- 
tions are also sometimes framed in terms of the de- 
gree of dependence of S^~'^ on Y^ in models for 
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pr{S^~'^\X,S,A,Y'^). Such assumptions allow one 
to avoid parametric assumptions about error distri- 
butions. It is fairly difficult to have precise quantita- 
tive knowledge about the degree of this dependency. 
One way to deal with such uncertainty is to perform 
a sensitivity analysis, in which one varies the sensi- 
tivity parameter(s) over a plausible range (Gilbert, 
Bosch and Hudgens, 2003); another way is to put 
a prior distribution on the dependency parameter, 
as done in various Bayesian analyses (Imbens and 
Rubin, 1997). 

Assumptions about the distribution of outcomes 
within a principal stratum are sometimes made 
in both Bayesian and frequentist inference. Most of- 
ten, /(y^|5) is assumed to be normal with unknown 
mean and variance (Imbens and Rubin, 1997); we 
will consider likelihood-based inference under this 
assumption. With these assumptions, identification 
results from the identifiability of mixtures of cer- 
tain parametric families of distributions (Teicher, 
1963). For example, consider a model with two prin- 
cipal strata, never-takers and compilers, for which 
no exclusion restriction is assumed but the densi- 
ties f(Y"'\S_) are assumed to be normal. Identifia- 
bility of /{y°|5 = (0,1)} and /{yO|5 = (0, 0)} re- 
sults from the fact that f{Y\A = 0) is a mixture of 
its two component normal distributions, and the pa- 
rameters of a mixture of two normals are identifiable 
(Teicher, 1963). In order to identify which mixture 
component represents f{Y^\S_ = (0,1)} and which 
represents f{Y^\S_= (0,0)}, it is required that the 
proportion of compilers not equal 0.5. Identification 
based on parametric assumptions about the distri- 
butions of outcomes f{Y^\S_) may not be robust to 
changes in parametric assumptions, as with selec- 
tion models (Copas and Li, 1997). 

Finally, one can make assumptions about the asso- 
ciations of covariates X with the potential outcomes 
within principal strata. For example, one might as- 
sume that principal stratum-specific effects are con- 
stant across covariates X [e.g., E{Y^ —Y^\X = x,S_) - 
E{Y^ — Y^\X = x' ,S_), for all x,x']; this assumption 
parallels assumptions we made for structural nested 
models that effects conditional on observed auxil- 
iaries do not vary with X. One might also assume 
that the association of the covariates with poten- 
tial outcomes has a particular parametric form [e.g., 
E{Y^\X,S) = q{S)+XP]. 

5. A SMALL SIMULATION EXPERIMENT 

In this section, we report results of a small simula- 
tion study to examine the performance of certain PS 



and observed auxiliary stratification estimators. We 
consider a continuous outcome Y, one binary covari- 
ate X, and four settings (two sets of expected val- 
ues for outcomes within the A,X,S_ strata and two 
distributions of outcomes within the A,X,S_ strata 
for each set of expected values). For the four set- 
tings, the expected values of Y within the A,X,S_ 
strata are shown in Table 3 along with the para- 
metric distribution within the A,X,S_ strata. We 
label the principal strata 5 by {S^ = 0, 5^ = 0) = 
immune (I), (5" = 1, S*^ = 0) = treatment protective 
(TP), (S° = 0, 51 = 1) = treatment harmful (TH) and 
{S^ = 1, 5^ = 1) = doomed (D) (Greenland and Robins, 
1986). For all settings, the standard deviation of 
y within each A,X,S. stratum is 1; E{Y\A,X = 
1, S) - E{Y\A,X = 0,S)= 0.5 for all A,S, so that 
the principal stratum-specific effects are constant 
across the covariate X; pr{S_ = I) = 0.25, pr{S_ = 
TP) = 0.4, pr{S = TH) = 0.05 and pr{S = D) = 0.3; 
andpr(X = 115 = 1) = 0.5, i9r(X = 115 = TP) = 0.75, 
pr{X = l\S = TH) = 0.25 and pr{X = l\S = D) = 
0.5. The sample size is 5000 for each setting and the 
probability of being randomly assigned treatment 
{A=l) is 0.5. 

We consider two PS estimators and one observed 
auxiliary stratification estimator. The PS estimators 
assume normal outcomes within each A,X,S_ stra- 
tum. This assumption is satisfied for settings 1(A) 
and 11(A) and violated for settings 1(B) and 11(B). 
The PS estimators furthermore make the correct as- 
sumption E{Y\A,X = I, S)-E{Y\A,X = 0,S) 
is the same for all A,S_. The first PS estimator we 
consider does not put any constraints on E{Y\ 
A,X^S). The second estimator constrains the av- 
erage effect of treatment in the immune principal 
stratum given a value of X[£'(y|^ = 1,X,5 = I) — 
E{Y\A = 0, X, 5 = I)] to be equal to the average ef- 
fect of treatment in the treatment protective prin- 
cipal stratum given the same value of X and con- 
strains the average effect of treatment in the treat- 
ment harmful principal stratum given a value of X 
to be equal to the average effect of treatment in 
the doomed principal stratum given the same value 
of X. The constraint made by the second PS es- 
timator is correct for settings 1(A) and 1(B) but 
is incorrect for settings 11(A) and 11(B). We used 
the EM algorithm to implement the PS estimators 
and used the true parameters as the starting values. 
The observed auxiliary stratification estimator is 
based on model (4) and uses the function 
<7{y°(*),X} = e{^){l-E{S\X,A = l),E{S\X,A = 
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1)} discussed in Section 4.3. Model (4) is correct 
for settings 1(A) and 1(B) but is incorrect for set- 
tings 11(A) and 11(B); in settings 11(A) and 11(B), 
E{Y\X,A,S) - E{Y^\X,A,S) varies with X. For 
the observed auxihary stratification estimator, we 
used the computational procedure for weighted 
G-estimation described in Ten Have, Elliott, Joffe, 
Zanutto and Datto (2004) using the true values as 
the starting values. R code for the simulations and 
analysis and an example dataset are available at 
www. cceb. upenn. edu/ faculty/ ?id= 157 . 

Table 4 displays the means and standard devia- 
tions of the estimators for 500 simulations of each 
setting. In setting 1(A), the assumptions made by all 
three estimators are correct. All the estimators pro- 
vide approximately unbiased estimates of their cor- 
responding estimands. The PS II estimator that con- 
strains certain average effects of treatment within 
principal stratum to be equal is the most efficient 
estimator. The PS II estimator and the observed 
auxiliary stratification estimator make similar as- 
sumptions, but the additional parametric assump- 
tions made by the PS II estimator make it consid- 
erably more efficient. In setting 1(B), the assump- 
tions made by the observed auxiliary stratification 
estimator continue to hold but the parametric as- 
sumptions made by the PS estimators are false. The 
observed auxiliary stratification estimator performs 
similarly in setting 1(B) as it did in setting 1(A) — it 
is approximately unbiased and has a similar vari- 
ance. In contrast, the PS estimators exhibit consid- 
erable bias for some of their estimands in setting 
1(B). The PS I estimator has a particularly large 
bias for the average effects of treatment in the im- 
mune and doomed principal strata. In settings 11(A) 
and 11(B), the assumption made by the observed 



auxiliary stratification estimator that treatment ef- 
fects conditional on observed auxiliaries do not vary 
with the covariate X is false. In the "Truth" col- 
umn for and ^i, we list the average realized 
effects of treatment for subjects with S*^ = and 
S"""^ = 1, respectively. The observed auxiliary stratifi- 
cation estimator of for settings 11(A) and 11(B) 
does not show much bias but the estimator of ^'i 
shows considerable bias. For the PS estimators, the 
assumptions made by PS estimator I are true in set- 
ting 11(A) and false in setting 11(B), and the as- 
sumptions made by PS estimator II are false in both 
settings. As in settings 1(A) and 1(B), PS estimator 
I is approximately unbiased when its parametric as- 
sumptions hold [setting 11(A)] but is considerably 
biased for some estimands when its parametric as- 
sumptions do not hold [setting 11(B)]. For setting 
11(A), PS estimator II provides slightly biased but 
low variance estimates of the average effects of treat- 
ment across the immune and treatment protective 
strata and the average effect of treatment across 
the treatment harmful and doomed strata (the value 
listed in the Truth column for the PS II estimator 
for immune=treatment protective is the average ef- 
fect of treatment among subjects with = and 
for treatment harmful=doomed is the average effect 
of treatment among subjects with = 1). For set- 
ting 11(B), PS estimator II's estimate of the average 
effect of treatment across the treatment harmful and 
doomed strata is substantially biased. 

In summary, all of the estimators are somewhat 
sensitive to their assumptions. The PS I estima- 
tor makes no assumption about how the average ef- 
fects within a principal stratum [i?(y |A = 1,X,S_)~ 
E{Y\A = 0, X,S_)] compare between principal strata, 
but is sensitive to its parametric assumptions. The 
PS II estimator and observed auxiliary stratification 



Table 3 

Parameter values for simulation study 



Setting 


1(A) 


1(B) 


11(A) 


11(B) 


E{Y\A =1,X = 0,S_= immune) 


2 


2 


2 


2 


E{Y\A = 0, X = 0, 5 = immune) 


1 


1 


1 


1 


E(Y\A = 1,X — 0,S_ = treatment protective) 


2.5 


2.5 


2.5 


2.5 


E(Y\A = 0,X — 0,S_ = treatment protective) 


1.5 


1.5 


1 


1 


E{Y\A =1,X — 0,S_= treatment harmful) 


1.25 


1.25 


1.25 


1.25 


E{Y\A = 0,X = 0,S_= treatment harmful) 


0.75 


0.75 


0.75 


0.75 


E{Y\A =1,X = 0,S = doomed) 


1.75 


1.75 


2.25 


2.25 


E{Y\A = 0,X^O,S = doomed) 


1.25 


1.25 


1.25 


1.25 


Distribution witliin eadi A,X, , stratum 


Normal 


Gamma 


Normal 


Gamma 
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Table 4 
Simulation study results 

Settings 1(A) and 1(B) 
Setting 1(A) Setting 1(B) 



Estimator Truth Mean SD Truth Mean SD 

Principal stratification estimator I average effects of treatment 

Immune 1 0.97 0.23 1 0.11 0.14 

Treatment protective 1 1.00 0.13 1 0.82 0.05 

Treatment harmful 0.5 0.54 0.48 0.5 0.68 0.06 

Doomed 0.5 0.51 0.19 0.5 1.36 0.10 

Principal stratification estimator II average effects of treatment 

Immune=treatment protective 1 1.00 0.03 1 0.78 0.14 

Treatment harmful=doomed 0.5 0.50 0.04 0.5 0.60 0.08 

Observed auxiliary stratification 

*o 1 1.00 0.12 1 1.00 0.12 

*i 0.5 0.50 0.21 0.5 0.50 0.21 



Settings 11(A) and 11(B) 



Estimator 


Setting 11(A) 




Setting 11(B) 




Truth Mean SD 


Truth 


Mean 


SD 




Principal stratification estimator I average effects of treatment 






Immune 


1 0.98 0.22 


1 


0.00 


0.14 


Treatment protective 


1.5 1.52 0.14 


1.5 


1.91 


0.06 


Treatment harmful 


0.5 0.60 0.54 


0.5 


1.16 


0.08 


Doomed 


1 0.96 0.21 


1 


0.43 


0.08 




Principal stratification estimator II average effects of treatment 






Immune=treatment protective 


1.42 1.33 0.03 


1.42 


1.63 


0.04 


Treatment harmful=doomed 


0.93 0.88 0.04 


0.93 


-0.01 


0.07 




Observed auxiliary stratification 








*o 


1.42 1.51 0.12 


1.42 


1.52 


0.13 




0.93 0.56 0.23 


0.93 


0.53 


0.23 



estimator both put the same constraints on how the 
average effects within principal strata compare, and 
both estimators are sensitive to these constraints 
holding. The PS II estimator was considerably more 
efficient than the observed auxiliary stratification es- 
timator when its parametric assumptions held, but 
showed considerable bias for some estimands when 
its parametric assumptions did not hold. We also 
performed the same set of simulations for a sam- 
ple size of 500; the results are not shown but are 
available from the authors. The performance of the 
estimators and the way in which the estimators com- 
pare were similar to what is shown in Table 3 for 
the larger sample size of 5000. Some notable fea- 
tures of the simulations for the smaller sample size 



of 500 were: (1) in a very small proportion of simu- 
lations (approximately 0.005%), the observed auxil- 
iary stratification estimator produced estimates that 
were very large in absolute value (more than 100 
times the true value); (2) in a very small propor- 
tion of simulations (approximately 0.1%), the PS I 
estimator did not converge; and (3) the PS estima- 
tors exhibited a small amount of bias (a maximum 
of about 0.1) for some estimands even when their 
parametric assumptions held. 

For the settings considered in Table 3 (with a sam- 
ple size of 5000), we also computed the expected 
auxiliary stratification estimator based on model (3). 
The means of the estimates of (3a + (3^a were —3.00, 
-2.99, -2.18 and -2.19 for settings 1(A), 1(B), 11(A) 
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and n(B), respectively. These mean estimates are 
not close to the effect of treatment for the sub- 
group of subjects with = 1. These results illus- 
trate the point mentioned at the end of Section 4.2 
that the expected auxiliary stratification estimates 
cannot necessarily be extrapolated to reliably esti- 
mate the effect of treatment for the subgroup of sub- 
jects who will have a certain value of the auxiliary 
variable if treated. 

6. ANALYSIS OF THE HIP DATA 

The Health Insurance Plan (HIP) Study (Shapiro 
et al., 1988) was a randomized trial of screening for 
breast cancer, in which more than 60,000 women 
aged 40-64 at the start were randomized into two 
groups. In the treatment arm, women were offered 
screening examinations, consisting of mammogra- 
phy and clinical exam, to be provided in an initial 
visit and three annual follow-up visits. About one- 
third of women in the treatment arm refused screen- 
ing; some of the others did not receive some of the 
follow-up exams; this information is also recorded. 
Women in the control group received usual care. 
The study recorded information on date of death 
for women who died; cause of death was classified 
as being due to breast cancer or not. The study also 
recorded information on whether and when a par- 
ticipant was diagnosed with breast cancer, and if 
so, whether the cancer was detected by one of the 
screening exams in the study. 

We analyzed the HIP data using G-estimation of 
structural models. For these analyses, our outcome 
Y is the natural logarithm of our failure-time, death 
from breast cancer. Because most subjects in the 
study do not experience the outcome of interest be- 
fore the end of follow-up, the outcome is said to be 
censored. 

We consider several choices of the auxiliary out- 
come S. For the first, S = 1 if a subject receives a 
screen, otherwise; for the second, 5 = 1 if a subject 
is screened and diagnosed with breast cancer, oth- 
erwise; for the third, 5" = 1 if a subject is diagnosed 
with breast cancer and that cancer was detected by 
the screen, otherwise. Here, the treatment of in- 
terest A is (randomized) assignment to the screen- 
ing arm. We assume that assignment to the screen- 
ing arm has no effect on the outcome unless a sub- 
ject actually is screened and is diagnosed (for the 
third choice, diagnosed based on the screen), and so 
^0 = 0. Because of censoring, we use as our function 



5{y°(*),X} = A(*), where A(*) = I{Y{^) < C(*)}, 
C(^') = min{C, C exp(^')} (for binary A, S, and scalar 

and C denotes a subject's potential censoring 
time (here ten years). Other references (Joffe, 2001; 
Mark and Robins, 1993b; Robins, 1992) discuss the 
more general approach to dealing with administra- 
tive or generalized type I censoring. Joffe (1994) pro- 
vides fuller justification for use of this particular 
function. To deal with competing risks, we weight 
the estimating functions for subjects not censored 
by competing risks by the inverse of their probabil- 
ity of being uncensored at the end of their follow- 
up; Robins et al. (1992) provide more details. We 
consider subjects who died of other causes as cen- 
sored by a competing risk, although this is some- 
what problematic (Kalbfleisch and Prentice, 2002). 

For our analysis, we consider the first ten years 
of follow-up for each subject. In this period, 1259 
subjects were diagnosed with breast cancer (626 of 
these were in the screening arm). There were 340 
deaths attributed to breast cancer: 193 in the control 
arm and 147 in the screening arm. Among the 441 
cancers diagnosed among subjects who received at 
least one screen, 132 cancers were detected by the 
screen; among the 100 screened breast cancer cases 
who died of breast cancer, 27 had been detected by 
the screen. 

We applied the modified G-estimation approach 
to these data. Figure 2 plots the Z-test statistics 
for each choice of S against ^i. The test statis- 
tics for *i = are identical [Z = -2.39, p = 0.017 
(2-sided)]. When S denotes diagnosis of breast can- 
cer among screened subjects, the absolute value of 
the statistic is minimized around ^'i = —0.20, and 
the 95% confidence interval is (—0.49,-0.06); this 
means that screening lengthens the time to breast 
cancer mortality by a factor of [exp{— (— 0.2)} — 1] * 
100% = 22% (6%, 63%) among screened subjects di- 
agnosed with breast cancer. We obtain identical re- 
sults when S denotes receiving a screen, and so the 
model estimates may also be interpreted as the effect 
of screening on screened subjects. This is so because, 
for any given value of g{Y^{'^),X} = A(*) is 
the same whether S denotes screening or cancer 
diagnosis following screening. Because screening is 
unavailable in the control arm, these estimates are 
also estimates of compiler average causal effects (Im- 
bens and Rubin, 1997). When S denotes breast can- 
cer diagnosis as the result of a screen, the corre- 
sponding point estimate (95% confidence interval) is 
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—0.76 (—1.15,-0.27); the corresponding lengthen- 
ing is 114% (31%, 216%). Thus, the effect of screen- 
ing among screened subjects diagnosed with breast 
cancer is less than the effect on the subset of these 
subjects whose diagnosis was a result of the screen- 
ing, as would be expected. 

7. UTILITY OF VARIOUS APPROACHES 

Causal modeling has multiple purposes. These 
include assisting in making decisions about possible 
interventions, predicting the results of those inter- 
ventions, and better understanding of the processes 
leading to the outcome(s) under study. We consider 
the approaches sketched above in terms of their util- 
ity for these purposes, as well as generalization of 
study results. 



7.1 Making Decisions 

Making decisions about possible interventions and 
predicting the results of those interventions are closely 
linked. Normally, one would want to choose, for any 
individual, the treatment that leads to the best pos- 
sible expected result, however that is defined. This 
decision must be made on the basis of information 
available at the time of the decision; for point expo- 
sures, that means only baseline covariates may be 
used to predict outcomes under a given treatment 
and so guide treatment decisions. 

We formalize this within a decision-theoretic frame- 
work. Let Lj(S'",y";a) denote an individual's loss 
function associated with decision a. The loss func- 
tion may be associated with the subject's principal 
stratum, observed auxiliaries or outcomes, measured 
pretreatment covariates, or other individual-specific 
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factors. To formalize this idea, let Q denote a col- 
lection of variables. We can write the expected loss 
given Q as E{Li{Sf ,Y{^ ,a)\Q} . Under the usual 
decision-theoretic framework, a decision should de- 
pend only on information available at the time of the 
decision; further, if the decision is to be based only 
on information available in the statistical models fit 
to the data, we can condition only on measured base- 
line covariates X. Under these constraints, decisions 
should be based only on comparisons of E{Li{Sf, 
Yf',a)\X} for different treatments a, and so depend 
on the joint density of the potential outcomes 
fiS.,Y.\X) only through the marginal densities 
f ls<',Y<'\X)= f{S<'\X)f{Y''\S'', X), where S = {3"}, 
a £ A and A denotes the set of possible treatments 
a. Thus, the optimal feasible decision based solely 
on data available to the study will depend on the 
marginal distributions f{S°'\X) and f{Y"'\S"',X), 
both of which are estimable using conventional meth- 
ods (the latter as described in Section 2.2); the use of 
the factorization of the marginal density f{S°',Y°') 
for estimation will be particularly important when 
is undefined for some values of 5" (e.g., with 
censoring by death). Relying solely on the marginal 
density /(y"|X) will be adequate for deciding among 
treatments if the loss does not depend on S"" or if the 
auxiliary variable is not affected by treatment and 
the loss function may be written as the sum of sepa- 
rate components associated with main and auxiliary 
outcomes [i.e., if Li{S'',Y'';a) = qiiS"") + q2(Y'') for 
some functions qi{-) and (/2(')]- If the loss is a func- 
tion of S", none of the estimands discussed in Sec- 
tion 3 is adequate by itself, as they focus only on 
the density of Y"" [some methods proposed in con- 
junction with PS (Imbens and Rubin, 1997) yield 
estimates of 7(5" |X) and so potentially could be 
used to estimate f{S°',Y"')]. In the nephrology ex- 
ample, the auxiliary outcome, ESRD, is a condition 
with severe and life-disrupting consequences (requir- 
ing dialysis or transplant) which would normally in- 
fluence the loss function. 

Suppose that the loss is not a function of S"". In 
the cancer screening example, it is arguable that 
the benefits of screening should be assessed solely 
in terms of mortality Y"", even though there are 
costs that are associated with cancer diagnosis, in- 
cluding the financial costs of administering screen- 
ing exams, patient inconvenience, and the costs of 
falsely diagnosing a subject as having cancer who 
actually does not, which can both lead to unneces- 
sary surgical procedures and have adverse psycho- 



logical impact (Brett, Bankhead, Henderson, Wat- 
son and Austoker, 2005; Barratt, Irwig, Glasziou, 
Salkeld and Houssami, 2002). If the loss is a func- 
tion of Y"" alone, optimal decisions should be based 
on f{Y"'\X); that is, we should seek to minimize 
E{LiiS^,Y-,a)\X} = E{L,{Y-,a)\X} = Jy^ Li{Y\ 
a)/(F°|A)dy". Suppose that, as is typical, Li(yj",a) 
is a monotonic function of Y^ for all subjects (e.g., 
mortality is never a desired outcome); suppose fur- 
ther that Li {Y-^ , a) does not depend on a and so 
Li{Yf,a) = Li{Yf' ,a') for all a, a' (e.g., the loss as- 
sociated with mortality does not depend on whether 
one had been screened). Then, the relative expected 
loss for an individual under one treatment relative to 
another can be assessed by comparing the densities 
f{Y-\X) and /(W I A). 

Although these quantities may be evaluated from 
a model which conditions on a post-treatment auxil- 
iary, such evaluation will typically be more involved 
than the simpler and more direct modeling of the 
marginal densities f{Y"'\X). Let Z denote a post- 
treatment auxiliary; Z can be a principal stratum or 
the single potential auxiliary S"". We can compute 
and estimate f{Y"'\X,Z) using methods discussed 
above (including the methods based on observed 
auxiliary stratification; Section 4.3); nonetheless, to 
make a decision, we must integrate out the auxiliary 
Z, as it is unknown at the time a decision is be- 
ing made. Thus, from models based on PS or single 
potential auxiliaries, we can assess the loss by eval- 
uating /(y^A) = /(l^l A, Z = z)pr{Z = z| A); 
these calculations are more involved than those based 
on stratifications based on the observed covariates 
A, where the most relevant comparison of densities 
can be read simply (or directly) from a single re- 
gression of y on A and A. Further, the increased 
number and complexity of the models required to 
evaluate the desired quantities may lead to increased 
likelihood of obtaining incorrect conclusions due to 
model misspecification. If the sharp null hypothe- 
sis of no effect is true, tests of the null using G- 
estimation of parameters in structural nested mean 
models will be (relatively) robust to model misspec- 
ification in randomized trials, because they essen- 
tially are based on intent-to-treatment tests of the 
null (Robins, 1992). 

Information on the expected value of post-treatment 
auxiliaries more generally might be of use in mak- 
ing decisions in situations in which decision mak- 
ers had a better idea of the value of these post- 
treatment auxiliaries than can be assessed from the 
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data. This could happen if, for example, a new test 
were developed that discriminates well between sub- 
jects who will and will not develop the auxiliary out- 
come. Then, effects for groups defined by an auxil- 
iary variable might be used to infer the effects of 
treatment for groups defined by this baseline vari- 
able. Additionally, finding that effects of treatment 
differ for subjects with different levels of the auxil- 
iary variable could be a useful spur to find baseline 
variables which predict the auxiliary variable well. 

For making decisions, even the expected auxiliary 
outcomes play no special role in principle. For this 
purpose, covariates which strongly modify the ef- 
fects of treatment (measured on a clinically relevant 
scale) are of greatest value. Sometimes, the expected 
auxiliary may be such an effect modifier; in the can- 
cer screening example, it is plausible that the risk 
of breast cancer would be an important modifier of 
the benefit of screening, especially if the benefit is 
assessed on the scale of difference in probability of 
mortality if screened or not. Unfortunately, we can- 
not evaluate this in the HIP study, which collected 
little covariate information. 

7. 1. 1 Treatments given over an extended period. 
Where the treatment is actually applied over an ex- 
tended period of time, estimands stratifying on post- 
treatment auxiliaries may have more immediate rel- 
evance for decision making. Suppose that the auxil- 
iary variable is measured shortly after the initiation 
of treatment. Suppose further that treatment never 
changes over the course of follow-up and that the 
effect of treatment on the ultimate outcome, mea- 
sured at the end of a fixed follow-up period, is cumu- 
lative (i.e., the effect of treatment given early during 
follow-up on this outcome is in the same direction 
as the effect of treatment later). Then, finding that 
treatment is beneficial for some groups defined by 
post-treatment variables and harmful for others will 
lead to recommendations to discontinue treatment 
for the group for whom treatment is harmful. 

To illustrate and formalize this, consider elabo- 
rating model (4) to model the effect of the different 
components A/^ of treatment received at different 
times k,k = l,...,K. Let A = (Ef=i Ak)/K be the 
average value of treatment received during follow- 
up. The model 

(7) Y° = Y -A{l-S)^o-AS^i 

is indistinguishable from model (4) when treatment 
for all subjects remains constant over time; if the 



model is correct and S is assessed early in follow- 
up, one would then be justified in making treatment 
decisions beyond k based on the value of S measured 
by k. Such inference is, however, strongly dependent 
on modeling assumptions. To see this, consider the 
model 

(8) Y^ = Y-Ai{l-S)^o-A,S^i, 

which specifies that the only treatment which af- 
fects the outcome is that received during the first 
period. When treatment A^, remains constant over 
time, this model is indistinguishable in the data 
from model (7); nonetheless, one model will suggest 
that treatment may be beneficial beyond k, whereas 
the other will not. 

7.2 Explanatory Analyses 

Another possible use of analysis stratifying on post- 
treatment auxiliaries is explanatory. In scientific work, 
one typically wants to explain the data in ways which 
will lead not only to an understanding of the data 
themselves but also to generalization to other set- 
tings (often beyond the sampling frame of the study). 
In this section, we consider the utility of analysis 
stratifying on post-treatment auxiliaries for such ex- 
planation and contrast it with approaches which 
concentrate on causal mechanism. 

7.2.1 Two types of explanation: effect modifica- 
tion vs. causal mechanism. Effects which vary by 
post-treatment auxiliaries can be used to explain 
observable differences in outcomes between groups 
receiving different treatments (White and Goetghe- 
beur, 1998). For example, in the HIP study, an ef- 
fect of screening ^'i of —0.2 for screened subjects 
diagnosed with breast cancer (or —0.76 for screened 
subjects diagnosed due to the screen), together with 
no screening effect in the remaining subjects, can 
explain the differences in survival between the ran- 
domized groups. We say that there is effect mea- 
sure modification (Rothman and Greenland, 1998) 
here by the post-treatment auxiliary (diagnosis af- 
ter screen or diagnosis due to screen) because the 
measure of treatment effect differs between the dif- 
ferent groups defined by the auxiliary [i.e., ^ 
in (4)]; this variation in effects can be used to ex- 
plain the differences between the groups. 

One might be tempted to conclude that cancer 
diagnosis (or its correlates) participates in the pro- 
cesses leading to improved survival and perhaps in- 
teracts causally with the treatment of interest. Such 
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mechanistic inference will often be suggestive but 
is fraught with pitfalls (Thompson, 1991). In par- 
ticular, the concept of effect measure modification 
(a.k.a. statistical interaction) differs from mechanis- 
tic interaction in that effect measure modification 
does not require that the modifier itself be a variable 
which can be modified directly, whereas mechanistic 
interaction requires the modifier to be the subject of 
intervention. Thus, one need not speak of the causal 
effect of an effect modifier. 

Effect modification by post-treatment auxiliary vari- 
ables provides less satisfactory and less satisfying 
explanations of observed associations when divorced 
from the concepts of causal mechanism and causal 
intermediates. To see this, consider a scenario in 
which the effect of aggressive blood pressure man- 
agement on MI is greater for subjects who subse- 
quently develop ESRD [e.g., \g{E{Y^\A = 1,5=1)}- 
g{E{Y^\A = 1, 5 = 1)}| > \g{EiY^A = 1,5 = 0)}- 
g{E{Y^\A = 1, 5 = 0)}| for some monotone link func- 
tion g{-); typically g(y) is either y (the identity link, 
leading to risk differences), ln(y) (the log link, lead- 
ing to risk ratios) or logit(y) (the logit link, leading 
to odds ratios)]. Contrast the following four state- 
ments: 

1 . The effect of aggressive treatment of blood pres- 
sure on MI is greater for people who subsequently 
develop ESRD (observed auxiliary stratification) , 
with the effects in the different subgroups ex- 
plaining the overall difference between subjects 
treated aggressively and those not. 

2. The effect of aggressive treatment of blood pres- 
sure on MI is greatest for people who would de- 
velop ESRD only if not treated aggressively (prin- 
cipal stratification). 

3. Aggressive treatment of blood pressure prevented 
the development of ESRD in some subjects (one 
principal stratum). This, in turn, prevented the 
development of MI for some subjects. Thus, ESRD 
mediated in part the effect of aggressive manage- 
ment of blood pressure on MI. 

4. Aggressive treatment of blood pressure prevents 
the development of ESRD in some subjects. This, 
in turn, prevents the development of MI for some 
subjects. Thus, ESRD mediates in part the effect 
of aggressive management of blood pressure on 
MI. 

The last two statements attempt to provide some 
understanding of the causal mechanisms leading from 



blood pressure treatment to MI; these are often con- 
sidered in terms of direct and indirect effects (Pearl, 
2001; Robins and Greenland, 1992; Robins, 1999). 
The first two do not attempt such explanation. State- 
ments 3 and 4 are identical except for the tense. 
Statement 3 is an attempt to explain what has hap- 
pened; statement 4 refers to more general causal pro- 
cesses and in principle is an attempt to predict the 
future and so is more ambitious. 

In general, we prefer statements 3 and 4. State- 
ments 1 and 2 are not, in general, useful for mak- 
ing treatment decisions, and the quality of their ex- 
planations of the data is poor. Where warranted 
based on appropriate subject-matter considerations, 
statements 3 and 4 provide informative explana- 
tions. At the risk of generalizing beyond the data at 
hand, statement 4 makes general statements about 
causal processes which are then potentially testable 
in other data. An important role of science is to 
extrapolate from one's data and make predictions 
which may be testable in other data or designs; this 
is most easily fostered by considering causal mech- 
anisms. This important scientific goal is sometimes 
fostered by considering effects of auxiliary variables 
that are not under the direct control of the inves- 
tigator in the given study; Rubin (2004) presents a 
somewhat contrary view. 

In the nephrology example, it is meaningful to 
consider the effect of ESRD on MI; this accords 
nicely with common usage. The effect can be ap- 
proximated by comparing what would happen to 
someone who has or develops ESRD with what would 
have happened had that person received a trans- 
plant from a genetically identical person (an iden- 
tical twin or identical triplets) who did not have 
ESRD. Even though this experiment could rarely, if 
ever, be carried out, it provides a useful approach for 
defining the effect in terms of a thought experiment. 

DAGs provide a nice intuitive representation of di- 
rect and indirect effects; the ideas of potential out- 
comes and counterfactuals allow these ideas to be 
made more precise. Figure 3 shows causal relation- 
ships among the variables. In this graph, the path 
A — >■ 5 — > y represents the indirect effect of A on Y 
(i.e., that part of the effect of A that is mediated by 
the specified variable S; it is necessary to specify the 
auxiliary variable(s) S to define what is meant by in- 
direct and direct effects) , and A^Y represents the 
direct effect of A on y (i.e., that part of the effect 
not mediated by S). Graphs like this are sometimes 



CAUSAL EFFECTS AND AUXILIARY OUTCOMES 



19 



known as path diagrams and have been used to jus- 
tify the use of hnear models for multivariate normal 
data (Pearl, 2000). The path-analytic approach suf- 
fers from the lack of a nonparametric definition of 
causal effects and generally unrealistic assumptions 
of multivariate normality; further, it does not ex- 
tend naturally to settings with interactions among 
the variables or to nonlinear models. 

The ideas of potential outcomes may be used to 
define direct, indirect and joint effects of treatment. 
Let y denote the (continuous) outcome one would 
see for a given subject at level a of the main treat- 
ment of interest (e.g., management of hypertension) 
and level s of the auxiliary outcome (e.g., ESRD). 
Underlying the notation is the assumption that the 
auxiliary outcome is in some sense manipulable. The 
notation has been used in conjunction with PS (An- 
grist, Imbens and Rubin, 1996) as well as discussions 
of direct and indirect effects. 

The direct effect of the main treatment A con- 
trolling for an auxiliary variable S may be defined 
as the contrast of Y"" and for an individ- 

ual or a group; that is, the direct effect contrasts 
the outcome that would be seen under two different 
levels of the primary treatment, physically manipu- 
lating the auxiliary variable to a given level s. Direct 
effects are not uniquely defined; there are separate 
direct effects for each level of the auxiliary variable 
s. 

There are, in fact, two separate types of direct 
effects (Pearl, 2001; Robins and Greenland, 1992; 
Robins, 2003), depending on the nature of the ma- 
nipulation of the auxiliary variable. If the auxiliary 
variable is set to a prespecified level s, the resulting 
contrasts are known as "controlled" direct effects. 
If the auxiliary variable is set to the level it would 




Fig. 3. A directed acyclic graph representing the relations 
among the variables in the example of Section 7.2. 



have reached at some reference level a* of the treat- 
ment, the effects are known as "natural" direct ef- 
fects; natural direct effects are contrasts of Y""^'^" 
and Y"-^'^" , a* G {a?,a^}. For a* = a°, the natural 
direct effect is a contrast of the potential outcome 
that would have been seen had the main treatment 
been set to level a} but the auxiliary variable set to 
the level 5° it would have taken had the subject 
been given level with the potential outcome that 
would have been seen had the subject been given 
level aP of the main treatment. 

Although there is no general definition of con- 
trolled indirect effects, natural indirect effects are 

nonparametrically defined as contrasts of y"*''^" 

and '^"^ ; that is, for a* = , the indirect effect 
is a contrast of what would have happened had a 
subject been given treatment aP but had his or her 
auxiliary variable set to the level 5" it would have 
attained had he or she been given a} with what 
would have happened had the subject been given 
. Statements 3 and 4 above may be understood 
as speaking about the natural indirect effects of ag- 
gressive management of blood pressure. 

7.2.2 Contrasting models for contrasting explana- 
tions. We present here a simple semiparametric model 
which allows us to contrast the ideas of mediation 
with those presented previously, in particular, with 
the modification of treatment effect by post-treatment 
covariates, as in (4). We consider a continuous out- 
come variable Y . A simple model for the joint effects 
of hypertension management and ESRD is 

(9) y'^'^ = y°'° + a7i + s72 + as73. 

In this model, 71 represents the controlled direct ef- 
fect of aggressive treatment of blood pressure (hold- 
ing S at 0), 72 represents the effect of modification 
of ESRD and 73 represents an interaction between 
ESRD and aggressive treatment of blood pressure. 
Under an assumption of monotonicity of the effect 
of treatment ^ on 5 (e.g., < 5*"), the parameter 
7 = {71,72,73} is related to the effects of aggressive 
treatment of blood pressure on principal strata as 
follows: 

for subjects doomed to develop ESRD {S^ = = 
1), the effect of aggressive treatment {Y^ — Y^ = 

for subjects immune to ESRD {S^ = 5*^ = 0), the 
effect of aggressive treatment is 71; 
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for subjects for whom treatment prevents ESRD, 
the effect of aggressive treatment is 71 — 72 • 

In this model, statement 2 (now relating to a con- 
tinuous cardiovascular outcome) that the beneficial 
effect of aggressive treatment of blood pressure is 
greatest for people who would develop ESRD only 
if not treated aggressively is implied by 72 > 0, 73 > 
—72 (assuming small values of Y are better out- 
comes). Further, if 72 > and aggressive treatment 
and ESRD do not interact in affecting the outcome 
(i.e., 73 = 0), then statement 2 is true, as is state- 
ment 4. 

Further, the effect of treatment among treated 
subjects who do not develop ESRD is 

E{Y^\A = 1,5 = 0)- E{Y^\A = 1,5 = 0) 

= E(yi'^'-yO'^V = 1,5^ = 0) 

= ^pr(5° = s|^ = 1,5^ = 0) 

s 

.EiY^^s^ -Y^^''\A = l,S^ = Q,S'^ = s) 

= 7i-pr(50 = l[5i = 0)72, 

and the effect of treatment among the treated who 
develop ESRD is 71 +73. Thus, ^'o in (4) equals 
71 -pr{S^ = 1\S^ = 0)72, and *i = 71 +73. 

In a randomized trial of A, the parameters in 
model (9) are identified from the data under two 
types of assumptions: 

1. The assumption that the initial treatment A is 
randomized, along with the modeling assumptions 
inherent in (9) (Robins and Greenland, 1994; Ten 
Have et al., 2004). In this approach, the modeling 
assumptions are not fully testable, even in very large 
studies. This approach is similar to that proposed 
above for observed auxiliary stratification and un- 
der some conditions is essentially identical. For G- 
estimation, the approach requires the same number 
of estimating equations as the number of free param- 
eters. Under the above normality and homoscedas- 
ticity assumptions and the assumption that /(y°'*^|X, 
A, S^) = /(y^'^lX), the optimal function is the vec- 
tor function g{Y^'"{'y),X} = e{-f){l, E{S\X, ^ = 1) - 
E{S\X,A = 0),E{S\X,A = 1)}^. We now require 
the covariate vector to predict not only the level of 
the auxiliary variable among the treated [as was true 
for model (4)], but also to predict the effect of the 
primary treatment on the auxiliary, and we require 
the three elements {l,E{S\X,A = 1) - E{S\X,A = 



0),E(S\X,A = 1)} to not be collinear; this often re- 
quires at least two covariates. Restrictions on the 
parameter space can result in a smaller vector of 
estimating functions; typically, the restrictions are 
either that there is no interaction between A and 
S (i.e., 73 = 0) (Robins and Greenland, 1994; Ten 
Have et al., 2004), or that some parameters play no 
role in explaining the data and so are not estimable. 
To explain the latter, suppose that one level of a bi- 
nary auxiliary variable S does not occur for some 
level of treatment A; for example, pr{S = 1\A = 
0) = 0; in the HIP study, this is true both when 
S denotes screening and when S denotes diagnosis 
due to screen. Under this condition, S = AS for all 
subjects, and one cannot simultaneously estimate 
both the main effect of S (72) and its interaction 
with A (73) in (9). Under this model, for a binary 
treatment and auxiliary variable, the effect of the 
auxiliary treatment S (72) is the same as the ef- 
fect of the main treatment A among treated sub- 
jects with 5 = l('J'i). Thus, the effect of assignment 
to screening for screened women who are diagnosed 
due to the screen [^1 in (4)] might also be inter- 
preted as the effect of diagnosis or early detection 
by screen [72 in (9)], and the effect of assignment 
to screen among subjects who received the screen 
might also be interpreted as the effect of screen- 
ing. 

2. The assumptions above, plus the assumption 
the assignment of the auxiliary variable S is ig- 
norable; that is, pr{S\X, A, L, Y^'') = pr{S\X, A, L), 
where X refers to covariates measured at baseline 
and L to covariates measured after baseline but be- 
fore S (Robins and Greenland, 1992). Under the ad- 
ditional but untestable ignorability assumption, the 
structural model (9) is fully testable. 

The DAG in Figure 3, which corresponds to a ran- 
domized trial of the main treatment A, is consis- 
tent with assumption 1 but not assumption 2. The 
assumption of initial randomization is justified be- 
cause there are no arrows into A; further, the arrow 
from X to 5 is consistent with an association be- 
tween X and S among treated subjects (also part 
of assumption 1). Assumption 2 is not justified, be- 
cause the arrows from U to S and U to Y imply 
that the effect of 5 on y is confounded. 

The joint effects of A and S are, in principle, iden- 
tifiable without making modeling assumptions in an 
experiment in which both factors (e.g., blood pres- 
sure treatment and kidney function) are experimen- 
tally controlled and both treatments assigned ran- 
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domly. Kidney function might be controlled experi- 
mentally by surgery. Ethical considerations preclude 
performing such experiments in humans; animal ex- 
perimentation could, in principle, be used to learn 
about the direct effect of blood pressure control. 

Model (9) is fairly simple and might not be a 
faithful representation of the real-world situation. 
In particular, the effects A and S in (9) might vary 
with observed covariates (e.g., baseline covariates 
X or observed treatment levels A) or latent covari- 
ates (e.g., the principal strata). Approaches for deal- 
ing with these issues (Robins, 1999; Robins, Rot- 
nitzky and Scharfstein, 2000; van der Laan and Pe- 
tersen, 2004; Robins, 2003) are beyond the scope of 
this paper. 

7.2.3 Explanation and generalization. Explanatory 
analysis can serve to explain the findings in the data 
at hand in ways which may generalize further to 
other populations or settings, a more ambitious task 
requiring further assumptions. Generalizability be- 
yond one's data is enhanced if the relations found in 
one's data are likely to hold in other settings. Two 
strategies for this are: 

1. Obtaining a finer understanding of the causal 
processes leading from treatment to outcome. With 
sufficiently detailed and accurate understanding, al- 
tering one component in the causal system may not 
change the other causal mechanisms in the system; 
this may allow better prediction of the effects of 
interventions in other settings (Pearl, 1995, 2000). 
Partitioning effects into component direct and indi- 
rect effects is an attempt to obtain understanding 
in those terms. 

2. Estimating the effects of treatment for homoge- 
nous subgroups in which causal effects in the popu- 
lation under study can be assumed to be similar to 
like groups in other populations. This can be done 
by estimating the effects of treatment for subgroups 
defined by observable variables X or S", or by the la- 
tent principal strata S_. Principal stratification can 
lead to a finer partition of the population than strat- 
ification on observed variables alone, and so can po- 
tentially lead to more generalizable conclusions. 

These strategies are not mutually exclusive. Struc- 
tural nested models have been formulated in a way 
which recognizes possible differences in effect in sub- 
groups not identifiable at baseline (Robins et al., 
1992; Robins, Rotnitzky and Scharfstein, 2000); our 
approach to using G-estimation for estimating treat- 
ment effects for strata defined by observed post- 



treatment auxiliaries allows finer stratification. Fur- 
ther, PS approaches have been used in conjunc- 
tion with approaches which allow one to consider 
partitioning causal effects (Angrist, Imbens and Ru- 
bin, 1996). 

8. FUTURE WORK: EXTENSIONS OF 
ESTIMANDS AND ESTIMATION 

We conclude the paper with a discussion of various 
additional complications which may arise in applied 
problems, and outline possible directions for future 
work in this context. 

8.1 Multiple Types of Auxiliary Variables 

In considering conditioning on a post-treatment 
auxiliary S, we have so far considered settings in 
which the outcome of interest Y is meaningful for 
all levels of S. Sometimes, the outcome of interest 
Y is not defined at some levels of S. Most notably, 
if S is (or includes) an indicator of vital status, the 
outcome Y may not be defined meaningfully for sub- 
jects who die, and so the effect of ^ on y (i.e., the 
contrast of Y^ and Y^) will be defined only for sub- 
jects who would live whether or not treated (i.e., 
the principal stratum in which = = 1, where 
S = 1 indicates being alive) . 

In the nephrology example, some subjects may de- 
velop ESRD only if their blood pressure is treated 
aggressively, because aggressive treatment may pre- 
vent mortality. Thus, a simple version of the mono- 
tonicity assumption stating that aggressive treat- 
ment never causes ESRD will not be true. Because 
of countervailing biases, the direction of bias of the 
naive approach (comparing among treated and un- 
treated subjects who develop ESRD) may not be 
predictable. 

Where there is a composite auxiliary variable (e.g., 
death and ESRD), there are several possible esti- 
mands. Let Si denote an auxiliary variable for which 
outcomes are well defined at all values of the vari- 
able (e.g., ESRD), and let 5*2 denote an auxiliary 
variable for which outcomes are well defined only for 
one value of the variable (e.g., death; 52 = 1 if a sub- 
ject lives, if dead). Because other outcomes are not 
meaningfully defined for people who die (Frangakis 
and Rubin, 2002; Kalbfieisch and Prentice, 2002), 
it is difficult to ignore mortality in defining causal 
estimands for this setting. This leaves fewer options 
for dealing with mortality: looking at the effect of 
treatment on the conditional distribution of Y given 
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survival, and PS. Within the framework of PS for 
mortality, almost any of the solutions above for the 
other auxiliary variable (ESRD) may be applied for 
defining causal estimands. The possible estimands 
include comparisons of the following expectations 
for different levels a of treatment: 

1. Eiy'lS^ = 1,S^ = 1), the effect of treatment for 
subjects who would not die whether or not treated 

2. E{Y"-\S2 =1), the effect of treatment on the ex- 
pectation of MI for subjects who would not die 
under that treatment. As before (Section 2.2), 
this is not a comparison of outcomes for a com- 
mon set of subjects. This estimand is most eas- 
ily understood in conjunction with the effect of 
treatment on survival [i.e., comparisons of i?(S'2 )]. 
This and the previous estimand are not defined 
in terms of ESRD. Estimands which use ESRD 
include: 

3. Ei^lSf = l,S^ = 1), the effect of treatment on 
the distribution of MI among subjects who would 
be alive and have ESRD under that treatment; 

4. EiV^lS^ = 1, Si = l,S^ = 1, = 1), the effect of 
treatment among subjects who, under both treat- 
ment levels, would be alive and develop ESRD 
(full or dual PS); 

5. Ei^lSl = 1,S^ = l,Sl = 1), the effect of treat- 
ment on subjects who would live whether or not 
treated and develop ESRD if treated (single po- 
tential auxiliary stratification for in conjunc- 
tion with PS for 52); 

6. EiVlSi = 1,5^ = 1,^2^ = l,A=l), the effect of 
treatment on subjects who would live whether or 
not treated, who are treated and develop ESRD 
(observed auxiliary stratification); and 

7. E{Y''\E{S^\X),S^ = l,Sl = l,A = l}, the effect 
of treatment on subjects at a particular risk of 
ESRD who would live whether or not treated. 

We expect that the likelihood-based or sensitivity 
analysis methods mentioned in Section 4.4 could be 
extended to deal with these issues and estimands. 

8.2 Other Extensions 

Many real- world problems, including studying the 
effect of the aggressive management of blood pres- 
sure on renal patients, involve problems that are 
substantially more complicated than those consid- 
ered here. Complications arise from the fact that all 
three main variables under study (i.e., treatment A, 
auxiliary variable S and outcome Y) may be more 
complex than the simple binary variables we have 



discussed. The additional complexity of each may 
require refinement or redefinition of the effects un- 
der study as well as of the methods used to estimate 
them. 

The auxiliary variable S of interest may, in fact, be 
measured repeatedly over time. In the renal study, 
the time ESRD develops will be noted; in the HIP 
study, the time of breast cancer diagnosis will be 
noted. For such a failure-time variable S, one could 
define effects of treatment based on the actual failure- 
time [e.g., compare E(Y^\S^ , S^) for different a], or 
on whether the failure-time exceeds some threshold 
[e.g., compare EiY^mS^ > s),I{S^ > s)} for dif- 
ferent o]. For failure-time outcomes, as in the HIP 
study, we can revise the causal estimand to account 
for the timing of changes in the auxiliary variable. 
Let S{t) = 1 after a subject with breast cancer is 
diagnosed by screen, otherwise (a modification 
of the third definition of S in Section 6). A ver- 
sion of the accelerated failure-time model with time- 
varying covariates [a generalization of (4)] is = 
Jq exp{AS{t)^}dt (Cox and Oakes, 1984; Robins, 
1992; Robins et al., 1992), where T is the subject's 
failure time and the failure-time if the subject 
had not been screened. The causal parameter ^ now 
represents the effect of screening in shortening time 
from cancer diagnosis to mortality among screen- 
diagnosed subjects. G-estimation could be applied 
for estimating parameters in this model. 

Similarly, the outcome variable Y may be a re- 
peated measures outcome. This allows many options 
for the time that the value of the auxiliary variable is 
measured: one may be interested in defining effects 
of treatment conditional on the value of the auxil- 
iary variable at the time the outcome Y is measured, 
or one year previously, or on the time of failure for a 
failure-time auxiliary such as ESRD. If the outcome 
is defined at all levels of the auxiliary, all options are 
potentially meaningful. We expect that any of the 
methods for defining and estimating causal quanti- 
ties sketched above (Sections 3 and 4) could apply. 

The study exposure or treatment is often not a 
simple scalar but may vary over time, and the joint 
effects of treatments received at different points in 
time may be of interest; this is true of observa- 
tional studies of the effect of aggressive manage- 
ment of blood pressure, in which therapy is pro- 
vided over an extended period and changes may be 
made over that time. Robins (Robins, Rotnitzky and 
Scharfstein, 2000) has provided a general approach 
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to defining the component and joint effects of treat- 
ments given over an extended period; tliis approacli 
would need to be generalized to allow the effects of 
a component of treatment or of a treatment plan 
to depend on auxiliary variables subsequent to the 
treatment. For defining the component effect of a 
treatment At applied at t, we might want to allow 
the effect to depend on St+i, for example. Alter- 
natively, we might want to allow the joint effect of 
the component treatments in a prespecified regime 
to depend on the level of an auxiliary variable that 
applies at a fixed point in time after the start of 
follow-up. 
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